Add leadership "domains" so multiple Rivers can operate in one schema by brandur · Pull Request #1113 · riverqueue/river

brandur · 2025-12-23T22:21:06Z

We've gotten a couple requests so far (see #342 and #1105) to be able to
start multiple River clients targeting different queues within the same
database/schema, and giving them the capacity to operate independently
enough to be functional. This is currently not possible because a single
leader is elected given a single schema and it handles all maintenance
operations including non-queue ones like periodic job enqueuing.

Here, add the idea of a LeaderDomain. This lets a user set the
"domain" on which a client will elect its leader and allowing multiple
leaders to be elected in a single schema. Each leader will run its own
maintenance services.

Setting LeaderDomain causes the additional effect of having
maintenance services start to operate only on the queues that their
client is configured for. The idea here is to give us backwards
compatibility in that the default behavior (in case of an unset
LeaderDomain) is the same, but providing a path for multiple leaders
to be interoperable with each other.

There are still a few edges: for example, reindexing is not queue
specific, so multiple leaders could be running a reindexer. I've
provided guidance in the config documentation that ideally, all clients
but one should have their reindexer disabled.

brandur · 2025-12-23T22:21:55Z

@bgentry This works (I think), but still needs some testing added. Wanted to get your general reaction before taking it all the way.

brandur · 2025-12-24T19:59:44Z

Okay, the tests for this should be in good shape now.

brandur · 2025-12-24T20:00:50Z

@@ -0,0 +1,3 @@
+ALTER TABLE /* TEMPLATE: schema */river_leader


Since we're adding a migration here, before shipping I'd also go through the existing "needs migration" issues and pull them in.

One nice thing about this migration as currently written is that if you did not run it, it wouldn't be a problem as long as you didn't try to use LeaderDomain. So it's safer than your average migration.

bgentry · 2025-12-31T16:23:19Z

+	// because the default client(s) will infringe on the domains of the
+	// non-default one(s).
+	//
+	// Certain maintenance services that aren't queue-related like the indexer


Suggested change

// Certain maintenance services that aren't queue-related like the indexer

// Certain maintenance services that aren't queue-related like the reindexer

Fixed! Thx.

bgentry · 2025-12-31T16:24:49Z

+	// In general, most River users should not need LeaderDomain, and when
+	// running multiple Rivers may want to consider using multiple databases and
+	// multiple schemas instead.


I would probably try to get some version of this warning in the first paragraph to dissuade people from using this feature. Could describe it as "an advanced option" or something like that. I would just like to ensure people don't start using this setting automatically just bc it's there because it has some major footguns.

Yep, great point. Fixed.

bgentry · 2025-12-31T16:34:27Z

+	// A warning though that River *does not protect against configuration
+	// mistakes*. If client1 on domain1 is configured for queue_a and queue_b,
+	// and client2 on domain2 is *also* configured for queue_a and queue_b, then
+	// both clients may end up running maintenance services on the same queues
+	// at the same time. It's the caller's responsibility to ensure that doesn't
+	// happen.


Gotta say this definitely feels dangerous. I'm wondering what else can break if we allow for breaking one of the main promises of leader election (there can only be one), including Pro features.

Yeah, it's admittedly not super great. I think the only alternatives though involving storing something to the database so clients can compare notes with one another, which is also not great for other reasons.

bgentry · 2025-12-31T16:35:21Z

+		// It's important for queuesIncluded to be `nil` in case it's not in use
+		// for the various driver queries to work correctly.


maybe the driver layer should handle nil and []string{} equivalently to avoid needing to deal with that concern at this level?

Yeah, good call. Modified the drivers to do that and added tests to each query in question.

bgentry · 2025-12-31T16:37:13Z

 	var sub *notifier.Subscription
 	if e.notifier == nil {
-		e.Logger.DebugContext(ctx, e.Name+": No notifier configured; starting in poll mode", "client_id", e.config.ClientID)
+		e.Logger.DebugContext(ctx, e.Name+": Resigned leadership successfully", "client_id", e.config.ClientID, "domain", e.config.Domain)


was this log text altered by mistake?

Yep, thx. Fixed.

bgentry · 2025-12-31T16:38:07Z

 		}

-		e.Logger.DebugContext(ctx, e.Name+": Current leader attempting reelect", "client_id", e.config.ClientID)
+		e.Logger.InfoContext(ctx, e.Name+": Current leader received forced resignation", "client_id", e.config.ClientID, "domain", e.config.Domain)


the text here was also changed, I think accidentally

Yep, thx. Fixed.

bgentry · 2025-12-31T16:39:39Z

 // another client, even in the event of a cancellation.
 func (e *Elector) attemptResignLoop(ctx context.Context) {
-	e.Logger.DebugContext(ctx, e.Name+": Attempting to resign leadership", "client_id", e.config.ClientID)
+	e.Logger.InfoContext(ctx, e.Name+": Current leader received forced resignation", "client_id", e.config.ClientID, "domain", e.config.Domain)


I don't think this text is necessarily true, there are other reasons this could be called besides a forced resignation.

Yeah, I think it's wrong. Reverted.

bgentry · 2025-12-31T16:58:38Z

 -- name: JobDeleteBefore :execresult
 DELETE FROM /* TEMPLATE: schema */river_job
-WHERE
-    id IN (
-        SELECT id
-        FROM /* TEMPLATE: schema */river_job
-        WHERE
+WHERE id IN (
+    SELECT id
+    FROM /* TEMPLATE: schema */river_job
+    WHERE (
            (state = 'cancelled' AND finalized_at < cast(@cancelled_finalized_at_horizon AS text)) OR
            (state = 'completed' AND finalized_at < cast(@completed_finalized_at_horizon AS text)) OR
            (state = 'discarded' AND finalized_at < cast(@discarded_finalized_at_horizon AS text))
-        ORDER BY id
-        LIMIT @max
-    )
-    -- This is really awful, but unless the `sqlc.slice` appears as the very
-    -- last parameter in the query things will fail if it includes more than one
-    -- element. The sqlc SQLite driver uses position-based placeholders (?1) for
-    -- most parameters, but unnamed ones with `sqlc.slice` (?), and when
-    -- positional parameters follow unnamed parameters great confusion is the
-    -- result. Making sure `sqlc.slice` is last is the only workaround I could
-    -- find, but it stops working if there are multiple clauses that need a
-    -- positional placeholder plus `sqlc.slice` like this one (the Postgres
-    -- driver supports a `queues_included` parameter that I couldn't support
-    -- here). The non-workaround version is (unfortunately) to never, ever use
-    -- the sqlc driver for SQLite -- it's not a little buggy, it's off the
-    -- charts buggy, and there's little interest from the maintainers in fixing
-    -- any of it. We already started using it though, so plough on.
-    AND (
-        cast(@queues_excluded_empty AS boolean)
-        OR river_job.queue NOT IN (sqlc.slice('queues_excluded'))
-    );
+        )
+        AND (/* TEMPLATE_BEGIN: queues_excluded_clause */ true /* TEMPLATE_END */)
+        AND (/* TEMPLATE_BEGIN: queues_included_clause */ true /* TEMPLATE_END */)
+    ORDER BY id
+    LIMIT @max
+);


This comment applies to all drivers but I'm leaving it here because it doesn't seem like the pgx driver's JobDeleteBefore was updated in this PR yet. This query is currently targeted at the following index:

"river_job_state_and_finalized_at_index" btree (state, finalized_at) WHERE finalized_at IS NOT NULL

That will break if we're also filtering by queue, and this may turn into a sequential scan.

I haven't checked the other queries you've modified here, but this is an important consideration we'd need to be very careful of for all of them.

@bgentry I ended up having the LLM run some minor benchmarking and checking index use. Here's what it produced:

When QueuesIncluded is NULL (the default, no LeaderDomain set): The query uses river_job_state_and_finalized_at_index as a single index scan, exactly as before. No change in behavior. When QueuesIncluded is a concrete array (non-default LeaderDomain): The planner switches to a BitmapOr strategy with three legs: - cancelled leg: uses river_job_state_and_finalized_at_index on (state, finalized_at) - completed leg: uses river_job_prioritized_fetching_index on (state, queue) - discarded leg: uses river_job_state_and_finalized_at_index on (state, finalized_at) Then finalized_at and queue are applied as heap filters. This is actually reasonable planner behavior. For the completed leg, when filtering by specific queues, the planner correctly identifies that (state, queue) from the fetching index is more selective than (state, finalized_at) — it narrows to both the state and the small set of queues before checking finalized_at. The total estimated cost (21.76) is lower than the NULL-case cost (194.13).

Maybe read that over to make sure you're satisfied, but I think even with the added queues, we should still be good here.

bgentry · 2025-12-31T17:10:12Z

+-- Alter `river_leader` to remove check constraint that `name` must be
+-- `default`. SQLite doesn't allow schema modifications, so this redefines the
+-- table entirely.


SQLite doesn't allow schema modifications, so this redefines the table entirely.

🤯

Yeah :( Awful.

bgentry · 2025-12-31T17:11:29Z

 	updatedSQL := sql
-	updatedSQL = replaceTemplate(updatedSQL, templateBeginEndRE)
 	updatedSQL = replaceTemplate(updatedSQL, templateRE)
+	updatedSQL = replaceTemplate(updatedSQL, templateBeginEndRE)


What's going on with the reordering here?

Just accidental again :/ Reverted.

NanoBjorn · 2026-01-09T12:14:00Z

+	// will continue to run on all leaders regardless of domain. If using this
+	// feature, it's a good idea to configure ReindexerTimeout on all but a
+	// single leader domain to river.NeverSchedule().


Now we are running many pods of the same worker, do we treat as "leader" a pod here or a group of workers? If we are saying that only one pod should be reindexing, then what do we do when the pod dies? Or having let's say 3 reindexer pods and many more that don't would be also alright?

brandur · 2026-05-01T18:11:07Z

Okay, I rebased this one to make the proposal a bit more current again in pursuit of coming up with a potential solution for #1105.

I had my LLM try to work some alternative solutions with me, and though it came up with a variety of alternatives, it didn't come up with anything substantially better. It would be better if you could have the client pick a value for leader domain based on each client's configured queues, but as we discussed yesterday, this doesn't work well operationally in case a queue is added or removed from a domain.

It did suggest renaming LeaderDomain to a less-abstract alternative like ClientGroup, and I think that's worth considering, but even that I'm not sure is substantially better. ("Client group" is okay in that it suggests that client's might be grouped together, but I kind of like that "leader domain" suggests concretely that this is related to leadership elections specifically. Maybe something like "leader group" would be better though.)

We've gotten a couple requests so far (see #342 and #1105) to be able to start multiple River clients targeting different queues within the same database/schema, and giving them the capacity to operate independently enough to be functional. This is currently not possible because a single leader is elected given a single schema and it handles all maintenance operations including non-queue ones like periodic job enqueuing. Here, add the idea of a `LeaderDomain`. This lets a user set the "domain" on which a client will elect its leader and allowing multiple leaders to be elected in a single schema. Each leader will run its own maintenance services. Setting `LeaderDomain` causes the additional effect of having maintenance services start to operate only on the queues that their client is configured for. The idea here is to give us backwards compatibility in that the default behavior (in case of an unset `LeaderDomain`) is the same, but providing a path for multiple leaders to be interoperable with each other. There are still a few edges: for example, reindexing is not queue specific, so multiple leaders could be running a reindexer. I've provided guidance in the config documentation that ideally, all clients but one should have their reindexer disabled.

brandur requested a review from bgentry December 23, 2025 22:21

brandur force-pushed the brandur-leadership-realms branch 4 times, most recently from 0819eec to 1c32845 Compare December 24, 2025 19:58

brandur mentioned this pull request Dec 24, 2025

Worker is trying to rescue jobs from queues that are not assigned to it #1105

Open

brandur commented Dec 24, 2025

View reviewed changes

brandur force-pushed the brandur-leadership-realms branch 2 times, most recently from ec7e682 to f41afcc Compare December 26, 2025 19:31

bgentry reviewed Dec 31, 2025

View reviewed changes

NanoBjorn reviewed Jan 9, 2026

View reviewed changes

brandur force-pushed the brandur-leadership-realms branch 2 times, most recently from 3bc1c32 to 1f12a24 Compare May 1, 2026 16:41

brandur force-pushed the brandur-leadership-realms branch from 1f12a24 to be4d567 Compare May 1, 2026 18:12

brandur force-pushed the brandur-leadership-realms branch from be4d567 to a5c8ae8 Compare May 1, 2026 18:13

		@@ -0,0 +1,3 @@
		ALTER TABLE /* TEMPLATE: schema */river_leader

	// Certain maintenance services that aren't queue-related like the indexer
	// Certain maintenance services that aren't queue-related like the reindexer

		// It's important for queuesIncluded to be `nil` in case it's not in use
		// for the various driver queries to work correctly.

Conversation

brandur commented Dec 23, 2025

Uh oh!

brandur commented Dec 23, 2025

Uh oh!

brandur commented Dec 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brandur commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants